Dyr og Data

Statistical thinking — generalized linear models

Gavin Simpson

Aarhus University

Mona Larsen

Aarhus University

2024-10-29

Introduction

Introduction

A Gaussian linear models describes the expected value (mean) of the response as a linear combination of the predictor variables

\[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \cdots + \beta_j x_{ij} + \varepsilon_i\]

Implies that a constant change in a predictor leads to a constant change in the response

This is intuitive for a response that has a normal distribution: can vary effectively indefinitely in either direction, or over small range (e.g. heights)

This is inappropriate for a wide range of types of response variables common in animal science data sets

Introduction

Animal science data are often discrete or categorical

Binary (dichotomous) variables

  • Suffering from a particular disease
  • Obese or normal weight

Count variables

  • number of disease recordings per herd
  • number of discrete play behaviours observed

Also continuous, positive (or non-negative) data such as concentrations of particular compounds in blood

Logistic Regression

Logistic Regression

A binary or dichotomous variable can take only one of two values

Numerically these are coded as 0 and 1 when we fit a logistic regression

  • a value of 1 is generally known as the success state; more usefully this indicates cases where risk factor or outcome of interest is present,
  • a value of 0 is the failure state; more usefully the absence of the risk factor or outcome of interest

Alternatively, we may have a data representing a proportion of successes from some total in a set of \(m\) trials or experiments

Such variables are strictly bounded at 0 and 1

Logistic Regression

Association between high somatic cell count (SCC) & milk yield in dairy cows

Define high SCC as SCC > 125,000 cells ml-1

Linear regression line could predicts negative probability of high SCC

Variance unlikely to be constant; reduced at 0 & 1

scc <- read_csv("data/itve-milk-scc/milk-scc.csv") |>
  rename(
    milk_yield_kg = kgmilk,
    cell_count = cellcount,
    cow_od = cowid
  ) |>
  mutate(
    high_scc = cell_count > 125,
    breed = factor(
      breed,
      levels = c(1,2,3,8),
      labels = c("Red Danish", "Holstein", "Jersey", "Cross-breed")),
    parity = factor(parity))

Logistic regression

p1 <- ggplot(scc, aes(x = milk_yield_kg, y = as.numeric(high_scc))) +
    geom_jitter(position = position_jitter(height = 0.05)) +
    ylab("High somatic cell count?") + xlab("Milk yield (kg)")
p1 + geom_smooth(method = "lm", se = FALSE)

Logistic Regression

Would like to fit an s-shaped curve to the data but we want a linear equation

Can achieve both aims via a transformation of the s-shaped response to a straight line: logit transformation

Logistic Regression

The logistic regression is given by the following equation

\[\log \left ( \frac{p}{1 - p} \right ) = \beta_0 + \beta_{1} x_1\]

\(\displaystyle \frac{p}{1 - p}\) is the odds, a measure that compares the probability of an event happening with the probability that it does not happen

If \(p\) = 0.25 the odds are \(\frac{0.25}{1 - 0.25} = \frac{1}{3}\) or 1 success for every 3 failures

If \(p\) = 0.8 the odds are \(\frac{0.8}{1 - 0.8} = 4\) or 4 success for each failure

The logit is the natural log of the odds; odds are non-negative number, but log odds can be any real number

Logistic Regression

\[\log \left ( \frac{p}{1 - p} \right ) = \beta_0 + \beta_{1} x_1\]

\(\beta_0\) a constant term; predicted log-odds of success (high SCC) where \(x\) is 0 (0 milk yield)

\(\beta_1\) is the slope; predicted change in the log-odds of success (high scc) for a 1 unit increase in \(x\) (an additional KG of milk per day)

Because log-odds can take any value

  • \(\beta_1 \mathsf{> 1}\); increasing \(x\) associated with an increase in probability of success
  • \(\beta_1 \mathsf{< 1}\); increasing \(x\) associated with a decrease in probability of success
  • \(\beta_1 \mathsf{= 0}\); increasing \(x\) has no effect on probability of success

Logistic Regression

A method known as maximum likelihood (ML) is used to estimate values of \(\beta_0\) and \(\beta_j\) from the data, the maximum likelihood estimates (MLEs)

So called because these estimates are the ones that make the data most likely under the model

In linear regression we minimised the residual sums of squares

With GLMs we minimise the deviance

Can compare likelihoods of two models using a likelihood ratio test

Logistic Regression

m <- glm(high_scc ~ milk_yield_kg, data = scc, family = binomial)
m0 <- glm(high_scc ~ 1, data = scc, family = binomial)
anova(m0, m, test = "LRT")
Analysis of Deviance Table

Model 1: high_scc ~ 1
Model 2: high_scc ~ milk_yield_kg
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1      1139     1577.2                          
2      1138     1562.2  1   15.056 0.0001044 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that compared to the null model of no change in risk of high SCC, the milk yield of a cow has a statistically significant effect on the probability of high SCC

Logistic Regression

printCoefmat(coef(summary(m)), digits = 4)
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    0.819109   0.195370   4.193 2.76e-05 ***
milk_yield_kg -0.035740   0.009306  -3.841 0.000123 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
round(exp(coef(m)), 3)
  (Intercept) milk_yield_kg 
        2.268         0.965 

Estimate contains the estimates for \(\beta_j\) in log odds; easier to interpret on odds scale

Odds of high SCC as a non-yielding cow are 2.268

\(\beta_1\) is negative, so higher yielding cows are associated with decreased risk of high SCC; for each additional kg of yield, odds of high SCC decrease 0.965 times

Odds ratio

Odds ratio of High SCC where milk_yield_kg is 20 and 25 kg

exp(-0.036 * (25 - 20))
[1] 0.8352702
m |> comparisons(variables = list(milk_yield_kg = c(20, 25)),
  comparison = "lnoravg", transform = "exp")

 Estimate Pr(>|z|)    S 2.5 % 97.5 %
    0.836   <0.001 13.0 0.763  0.916

Term: milk_yield_kg
Type:  response 
Comparison: ln(odds(25) / odds(20))
Columns: term, contrast, estimate, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted 

Risk ratio

A risk ratio or relative risk is the ratio of the risk under two conditions

  • with or without a risk factor.
  • increased milk yield from 20 to 25 kg

\[ \text{risk} = \frac{\text{number of cases}}{\text{number at risk}} \]

\[ \text{risk ratio} = \frac{\text{risk}_a}{\text{risk}_b} \]

Risk ratio

m |> avg_predictions(variables = list(milk_yield_kg = c(20, 25)),
  type = "response")

 milk_yield_kg Estimate Std. Error    z Pr(>|z|)     S 2.5 % 97.5 %
            20    0.526     0.0149 35.3   <0.001 906.2 0.497  0.555
            25    0.481     0.0189 25.5   <0.001 472.4 0.444  0.518

Type:  response 
Columns: milk_yield_kg, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
m |> comparisons(variables = list(milk_yield_kg = c(20, 25)),
  comparison = "lnratioavg", transform = "exp")

 Estimate Pr(>|z|)    S 2.5 % 97.5 %
    0.915   <0.001 11.9 0.873   0.96

Term: milk_yield_kg
Type:  response 
Comparison: ln(mean(25) / mean(20))
Columns: term, contrast, estimate, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted 

Logistic Regression

Fitted functions on scale of \(\eta\) and response

printCoefmat(coef(summary(m)), digits = 4)
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    0.819109   0.195370   4.193 2.76e-05 ***
milk_yield_kg -0.035740   0.009306  -3.841 0.000123 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Tests of \(H_0: \; \beta_j = 0\) are done using a Wald statistic instead of the \(t\) test

\[z = \frac{\beta_j - 0}{\mathsf{SE}(\beta_j)}\]

Rather than following a \(t\) distribution, Wald statistics are approximately normally distributed; probability of seeing a \(z\) as extreme as observed under a standard normal distribution

Logistic Regression

Fitted functions on the scale of \(\eta\) and the response

lr_p1 <-  m |> plot_predictions(by = "milk_yield_kg", type = "link") +
  labs(y = "Log odds", x = "Milk yield (kg)")
lr_p2  <- m |> plot_predictions(by = "milk_yield_kg", type = "link") +
  labs(y = "Probability of high SCC", x = "Milk yield (kg)")
lr_p1 + lr_p2

Logistic Regression

new_df <- data.frame(milk_yield_kg = seq(3, 100, by = 1))
lr_p1 <-  m |> plot_predictions(by = "milk_yield_kg",
  type = "link", newdata = new_df) +
  labs(y = "Log odds", x = "Milk yield (kg)")
lr_p2  <- m |> plot_predictions(by = "milk_yield_kg",
  type = "response", newdata = new_df) +
  labs(y = "Probability of high SCC", x = "Milk yield (kg)")
lr_p1 + lr_p2

Generalised Linear Models

Generalised Linear Models

Generalised Linear Models (GLMs) are a powerful extension to the linear regression model, extending the types of data & conditional distributions that can be modelled beyond the normal or Gaussian distribution of linear regression

Binary (dichotomous) variables can be modelled using logistic regression

Count data can be modelled using Poisson regression

These are two special cases of the broad class of GLMs

Also includes the linear regression as a special case

Generalised Linear Models

Logistic regression is a special case of the the GLM where the conditional distribution of the response was assumed to be Binomial

GLMs allow the conditional distribution of the response to be any distribution from the exponential family; Poisson, binomial, Gaussian, gamma, multinomial, …

There are three parts to a GLM

  • The conditional distribution or \(y\)
  • The linear predictor \(\eta\), and
  • The link function

Whilst this affords a degree of choice, often natural selections for the conditional distribution of \(y\) and the link function arise from type of data being modelled

Generalised Linear Models

In a GLM we want a model for the expectation of \(y\), \(\mathsf{E}(y_i)\), which is commonly abbreviated to \(\mu_i\)

We might model \(\mu_i\) as following a Poisson distribution if the data were counts, or as a binomial distribution in the case of dichotomous data, as we did the in the high SCC example

We need to decide which predictor variables and any transformations of them should be used to predict \(y\); this is the linear predictor, \(\eta\)

\[\eta_i = \beta_0 + \beta_{i1} x_{i1} + \beta_{i2} x_{i2} + \cdots + \beta_{ij} \varepsilon\]

Finally we need to map the values from the response scale on to a linear scale, just as we mapped from probabilities to log-odds using the logit transformation. This is the link function, \(g()\)

Poisson GLM

Poisson GLM

Often, we’ll encounter count data of some sort; number of cases of a disease in an epidemic

Counts are strictly non-negative; can’t have a count of less than 0

Variance increases as a function of the mean (lambda; \(\lambda\))

Poisson GLM

The Poisson distribution is a standard distribution for modelling count data

The Poisson gives the distribution of the number of things (individuals, counts, events) in a given sampling interval/effort if each event is independent

The Poisson is a simple distribution; defined by a single parameter \(\lambda\), which is the mean and the variance

The standard link function for a Poisson GLM is the log link; maps non-negative counts on to a linear scale

Poisson GLM

\[\begin{align*} y_i & \sim \text{Poisson}(\mu_i) \\ \lambda_i & = \mathbb{E}(y_i) = \mu_i \\ \log(\mu_i) & = \beta_0 + \beta_1 x_1 + \cdots \end{align*}\]

Poisson GLM: Example

Association between endometritis and breed of cattle in Danish dairy herds

endo <- tibble::tribble(
  ~ breed,    ~ endometritis, ~ count,
  "Holstein", "Yes", 100,
  "Holstein", "No", 400,
  "Jersey", "Yes", 10,
  "Jersey", "No", 190
) |>
mutate(
  breed = factor(breed),
  endometritis = factor(endometritis, levels = c("No", "Yes"))
)
endo
# A tibble: 4 × 3
  breed    endometritis count
  <fct>    <fct>        <dbl>
1 Holstein Yes            100
2 Holstein No             400
3 Jersey   Yes             10
4 Jersey   No             190

Poisson GLM: Example

Data like this are common in biology — contingency tables

Yes No Total
Holstein 100 400 500
Jersey 10 190 200
Total 110 590 700

A key question of interest is independence of the effects of variables on the cell frequencies

Poisson GLM: Example

Analysis of deviance table similar to sequential SS from linear regression

endo1 <- glm(count ~ breed + endometritis,
  data = endo, family = poisson(link = "log"))
anova(endo1, test = "Chisq")
Analysis of Deviance Table

Model: poisson, link: log

Response: count

Terms added sequentially (first to last)

             Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                             3     523.43              
breed         1   132.83         2     390.60 < 2.2e-16 ***
endometritis  1   361.54         1      29.05 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Poisson GLM: interaction

Add an interaction between breed and endometritis to the model & refit

endo2 <- update(endo1, . ~ . + breed:endometritis)
summary(endo2)

Call:
glm(formula = count ~ breed + endometritis + breed:endometritis, 
    family = poisson(link = "log"), data = endo)

Coefficients:
                            Estimate Std. Error z value Pr(>|z|)    
(Intercept)                  5.99146    0.05000 119.829  < 2e-16 ***
breedJersey                 -0.74444    0.08811  -8.449  < 2e-16 ***
endometritisYes             -1.38629    0.11180 -12.399  < 2e-16 ***
breedJersey:endometritisYes -1.55814    0.34317  -4.540 5.61e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance:  5.2343e+02  on 3  degrees of freedom
Residual deviance: -3.9968e-14  on 0  degrees of freedom
AIC: 33.517

Number of Fisher Scoring iterations: 3

Poisson GLM: interaction

Compare models using likelihood ratio test

anova(endo1, endo2, test = "LRT")
Analysis of Deviance Table

Model 1: count ~ breed + endometritis
Model 2: count ~ breed + endometritis + breed:endometritis
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)    
1         1     29.054                         
2         0      0.000  1   29.054 7.04e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Compare models using AIC

AIC(endo1, endo2)
      df      AIC
endo1  3 60.57105
endo2  4 33.51737

Butterfly colour

Mayekar & Kodandaramaiah (2017; PLoS One, 12, e0171482) studied determinants of pupal colour (green vs. brown) of a tropical butterfly

Response

  • Green vs Brown

Predictors

  • time to pupation
  • pupal weight
  • sex

Discuss in your groups:

  • what types of data each of the variables is?
  • what kind of GLM is appropriate?

Problematic grizzly bears

Moorehouse et al (2016; PLoS One, 11, e0165425) used genetic analyses to determine parentage of bear cubs and cross classified cubs and mothers and whether they caused problems around humans

Problematic grizzly bears

Table shows the number of cubs and mothers classified by whether they are problematic or not

Mother Total
Yes No
Offspring Yes 5 18 23
No 3 50 53
Total 8 68 76

Discuss in your groups:

  • what types of data each of the variables is?
  • what kind of GLM is appropriate?

Cat Owner’s Attitudes

Teng et al (2020; PLoS One, 15, e0234190) surveyed cat owners in Australia on factors that might affect affect the prevalence of overweight and obese cats

Body Condition Scores

  1. very underweight
  2. somewhat underweight
  3. ideal
  4. chubby/overweight
  5. fat/obese

Begging behaviour

  1. Never
  2. Sometimes
  3. Often
  4. Always

Cat Owner’s Attitudes

Begging? BCS 1 & 2 BC 3 BCS 4 & 5
Never 21 266 56
Sometimes 55 527 162
Often 14 106 69
Always 10 47 44

Discuss in your groups:

  • what types of data each of the variables is?
  • what kind of GLM is appropriate?